HW 03

Author

Weston Scott

1 - Du Bois challenge.

income <- read.csv("data/income.csv")

income <- income |>
    mutate(
        Average_Income = as.integer(Average_Income),
        ClassLabel = factor(
            paste0(Class, " | $", 
                   format(Average_Income, 
                          big.mark = ",")
                  ),
            levels = unique(paste0(Class, " | $", 
                               format(Average_Income, 
                                      big.mark = ",")))
            )
    ) |>

    pivot_longer(
        cols = Rent:Other, ## list slice-like syntax to get the ordered columns
        names_to = "Category", 
        values_to = "Percent"
    ) |>

    mutate(
        Category = factor(Category, 
                          levels = c("Other", 
                                     "Tax", 
                                     "Clothes", 
                                     "Food", 
                                     "Rent")),
        text_color = ifelse(Category == "Rent", 
                            "white", 
                            "black")
    ) |>

    group_by(ClassLabel) |>
    mutate(pos = cumsum(Percent) - Percent / 2) |>
    ungroup() |> glimpse()
Rows: 35
Columns: 7
$ Class          <chr> "$100-200", "$100-200", "$100-200", "$10…
$ Average_Income <int> 139, 139, 139, 139, 139, 249, 249, 249, …
$ ClassLabel     <fct> "$100-200 | $  139", "$100-200 | $  139"…
$ Category       <fct> Rent, Food, Clothes, Tax, Other, Rent, F…
$ Percent        <dbl> 19.0, 43.0, 28.0, 9.9, 0.1, 22.0, 47.0, …
$ text_color     <chr> "white", "black", "black", "black", "bla…
$ pos            <dbl> 9.50, 40.50, 76.00, 94.95, 99.95, 11.00,…
category_colors <- c(
    Rent = "black",
    Food = "slateblue4",
    Clothes = "rosybrown2",
    Tax = "gray60",
    Other = "tan"
)
du_bois <- ggplot(
    income, 
    aes(x = fct_rev(ClassLabel), 
        y = Percent, 
        fill = Category)
    ) +

    geom_col(color = "black", 
             width = 0.7) +

    geom_text(data = filter(income, 
                            Percent > 1),
              aes(label = paste0(round(Percent, 1), "%"), 
                  y = pos, 
                  color = text_color), 
              size = 2.5
    ) +
    scale_fill_manual(values = category_colors) +
    scale_color_manual(values = c("white" = "white", 
                                  "black" = "black")) +

    coord_flip() +
    scale_y_continuous(labels = NULL) +
  
    annotate("text", 
             x = c(1, 2.5, 4.5, 6.5), 
             y = 102, 
             label = c("Well-To-Do", 
                       "Comfortable", 
                       "Fair", 
                       "Poor"), 
             size = 2.5, 
             angle = 90) +
  
  labs(
      x = NULL, 
      y = NULL, 
      title = "Annual Expenditure For Provided Data",
      text_color = NULL
  ) +

  theme(
      axis.title = element_blank(),
      axis.text.y = element_text(face = "bold", 
                                 size = 8),
      panel.grid = element_blank(),
      legend.title = element_blank(),
      legend.position = "top",
      plot.title = element_text(hjust = 0.5, margin = margin(b = 10))
  ) +

  guides(fill = guide_legend(reverse = TRUE), 
         color = "none")

ggbackground(du_bois, "images/paper.jpg")

Resources

Utilized a few resources to help me understand how to do this (Wickham and Henry 2024), (Yu 2018).

2 - COVID survey - interpret

Observation 1

The first relationship I noticed was the question “Had flu vaccine this year”. I am understanding that the error bar lengths for the “No” response seem much longer than those of the “Yes” response for all top level COVID questions. It would be my assumption that those individuals that do not obtain the flu shot likely follow more information on the subject matter of whether or not the COVID vaccine is safe versus not safe. Those that get the flu shot appear to have responses overall that are more centrally located to the means, telling me that they are either not informed or maybe are simply not as concerned with the situation as compared to those that did not get the flu shot.

Observation 2

Looking at the “I trust information that I have received about the vaccines” column has a very small deviation from a localized mean across the board. Every combination appears to have show low confidence, leading to small error bars, with more survey results in the lower values for that question.

Observation 3

Something interesting that I think is quite visible with the provided image is that for the entire set of data, there are 2 question columns that tend to have the most diverse results, meaning the widest spread of answers, or the 10th and 90th percentile bars are on average the longest. The questions are “Based on my understanding, I believe the vaccine is safe” and “I am concerned about the safety and side effects of the vaccine.” I would say that these results would directly reflect information (or misinformation) dispersed to the masses. The length of the bars suggest that there are more people on either end of the spectrum for the questions then there are neutral responses.

Observation 4

A final observation I made looking at this dataset involved the age demographic against the “Based on my understanding, I believe the vaccine is safe”. There is are large error bars for each age group except the youngest group. I attribute that to simply youth not being as informed relative to the information that is being dispersed. The spread of the survey results for the youth is minimal.

3 - COVID survey - reconstruct

covid_survey <- read.csv("data/covid-survey.csv")

covid_survey <- covid_survey |> 
    row_to_names(row_number = 1) |>
    clean_names() |>
    mutate(
        across(everything(), 
               ~ na_if(., ""))
    ) |>

    filter(
        if_any(-response_id, 
               ~ !is.na(.))
    )
covid_survey <- covid_survey |>
    mutate(
        exp_already_vax = recode(exp_already_vax,
                                 "1" ="Yes", 
                                 "0" = "No"),
        exp_flu_vax = recode(exp_flu_vax,
                             "1" ="Yes", 
                             "0" = "No"),
        exp_profession = recode(exp_profession,
                                "1" = "Nursing", 
                                "0" = "Medical"),
        exp_gender = recode(exp_gender,
                            "0" = "Male",
                            "1" = "Female",
                            "3" = "Non-binary/Third gender",
                            "4" = "Prefer not to say"),
        exp_race = recode(exp_race,
                          "1" = "American Indian/Alaskan Native",
                          "2" = "Asian",
                          "3" = "Black/African American",
                          "4" = "Native Hawaiian/Other Pacific Islander",
                          "5" = "White"),
        exp_ethnicity = recode(exp_ethnicity,
                               "1" = "Hispanic/Latino",
                               "2" = "Non-Hispanic/Non-Latino"),
        exp_age_bin = recode(exp_age_bin,
                             "0" = "<20",
                             "20" = "21-25",
                             "25" = "26-30",
                             "30" = ">30")
    
    )
covid_survey_longer <- covid_survey |>
    pivot_longer(
        cols = starts_with("exp_"),
        names_to = "explanatory",
        values_to = "explanatory_value"
    ) |>
    filter(!is.na(explanatory_value)) |>
    pivot_longer(
        cols = starts_with("resp_"),
        names_to = "response",
        values_to = "response_value"
    )

covid_survey_longer
# A tibble: 43,428 × 5
   response_id explanatory    explanatory_value response         
   <chr>       <chr>          <chr>             <chr>            
 1 1           exp_profession Nursing           resp_safety      
 2 1           exp_profession Nursing           resp_confidence_…
 3 1           exp_profession Nursing           resp_concern_saf…
 4 1           exp_profession Nursing           resp_feel_safe_a…
 5 1           exp_profession Nursing           resp_will_recomm…
 6 1           exp_profession Nursing           resp_trust_info  
 7 1           exp_flu_vax    Yes               resp_safety      
 8 1           exp_flu_vax    Yes               resp_confidence_…
 9 1           exp_flu_vax    Yes               resp_concern_saf…
10 1           exp_flu_vax    Yes               resp_feel_safe_a…
# ℹ 43,418 more rows
# ℹ 1 more variable: response_value <chr>
covid_survey_summary_stats_by_group <- covid_survey_longer |>
    group_by(explanatory, explanatory_value, response) |>
    summarize(
        mean = mean(as.numeric(response_value), 
                    na.rm = TRUE),
        low  = quantile(as.numeric(response_value), 
                        probs = 0.1, 
                        na.rm = TRUE),
        high = quantile(as.numeric(response_value), 
                        probs = 0.9, 
                        na.rm = TRUE)
  )

covid_survey_summary_stats_by_group
# A tibble: 126 × 6
# Groups:   explanatory, explanatory_value [21]
   explanatory explanatory_value response        mean   low  high
   <chr>       <chr>             <chr>          <dbl> <dbl> <dbl>
 1 exp_age_bin 21-25             resp_concern_…  3.32     2     5
 2 exp_age_bin 21-25             resp_confiden…  1.31     1     2
 3 exp_age_bin 21-25             resp_feel_saf…  1.20     1     2
 4 exp_age_bin 21-25             resp_safety     1.95     1     5
 5 exp_age_bin 21-25             resp_trust_in…  1.29     1     2
 6 exp_age_bin 21-25             resp_will_rec…  1.09     1     1
 7 exp_age_bin 26-30             resp_concern_…  3.35     1     5
 8 exp_age_bin 26-30             resp_confiden…  1.40     1     2
 9 exp_age_bin 26-30             resp_feel_saf…  1.29     1     2
10 exp_age_bin 26-30             resp_safety     2.16     1     5
# ℹ 116 more rows
covid_survey_summary_stats_all <- covid_survey_longer |>
    group_by(response) |>
    summarize(
        mean = mean(as.numeric(response_value), 
                    na.rm = TRUE),
        low  = quantile(as.numeric(response_value), 
                        probs = 0.1, 
                        na.rm = TRUE),
        high = quantile(as.numeric(response_value), 
                        probs = 0.9, 
                        na.rm = TRUE),
        explanatory = "All",
        explanatory_value = ""

  )

covid_survey_summary_stats_all
# A tibble: 6 × 6
  response         mean   low  high explanatory explanatory_value
  <chr>           <dbl> <dbl> <dbl> <chr>       <chr>            
1 resp_concern_s…  3.28     1     5 All         ""               
2 resp_confidenc…  1.43     1     2 All         ""               
3 resp_feel_safe…  1.36     1     2 All         ""               
4 resp_safety      2.03     1     5 All         ""               
5 resp_trust_info  1.40     1     2 All         ""               
6 resp_will_reco…  1.21     1     2 All         ""               
covid_survey_summary_stats <- bind_rows(
    covid_survey_summary_stats_all,
    covid_survey_summary_stats_by_group
)
covid_survey_summary_stats
# A tibble: 132 × 6
   response        mean   low  high explanatory explanatory_value
   <chr>          <dbl> <dbl> <dbl> <chr>       <chr>            
 1 resp_concern_…  3.28     1     5 All         ""               
 2 resp_confiden…  1.43     1     2 All         ""               
 3 resp_feel_saf…  1.36     1     2 All         ""               
 4 resp_safety     2.03     1     5 All         ""               
 5 resp_trust_in…  1.40     1     2 All         ""               
 6 resp_will_rec…  1.21     1     2 All         ""               
 7 resp_concern_…  3.32     2     5 exp_age_bin "21-25"          
 8 resp_confiden…  1.31     1     2 exp_age_bin "21-25"          
 9 resp_feel_saf…  1.20     1     2 exp_age_bin "21-25"          
10 resp_safety     1.95     1     5 exp_age_bin "21-25"          
# ℹ 122 more rows
response_labels <- c(
    "resp_safety" = "Based on my\nunderstanding, I\nbelieve the vaccine\nis safe",
    "resp_feel_safe_at_work" = "Getting the vaccine\nwill make me feel\nsafer at work",
    "resp_concern_safety" = "I am concerned\nabout the safety\nand side effects of\nthe vaccine",
    "resp_confidence_science" = "I am confident in\nthe scientific\nvetting process for\nthe new COVID\nvaccines",
    "resp_trust_info" = "I trust the\ninformation that I\nhave received about\nthe vaccines",
    "resp_will_recommend" = "I will recommend\nthe vaccine to\nfamily, friends,\nand community\nmembers"
)

explanatory_labels <- c(
    "All" = "All",
    "exp_age_bin" = "Age",
    "exp_gender" = "Gender",
    "exp_race" = "Race",
    "exp_ethnicity" = "Ethnicity",
    "exp_profession" = "Profession",
    "exp_already_vax" = "Had COVID vaccine",
    "exp_flu_vax" = "Had flu vaccine this year"
)

covid_survey_summary_stats <- covid_survey_summary_stats |>
    mutate(
        response = factor(response, 
                          levels = names(response_labels),
                          labels = unname(response_labels)),
        explanatory = factor(explanatory, 
                             levels = names(explanatory_labels),
                             labels = unname(explanatory_labels))
    ) 
ggplot(
    covid_survey_summary_stats, 
    aes(x = mean, 
        y = explanatory_value, 
        xmin = low, 
        xmax = high)
) +

geom_pointrange(size = 0.1) +
geom_errorbar(
    aes(xmin = low, 
        xmax = high), 
        width = 0.3
) +

facet_grid(
    explanatory ~ response, 
    labeller = labeller(
        explanatory = explanatory_labels,
        response = response_labels
    ),
    space = "free_y",
    scales = "free_y"
) +

scale_x_continuous(breaks = 1:5, 
                   limits = c(1,5)) +

labs(x = "Mean Likert score\n(Error bars range from 10th to 90th percentile)",
     y = NULL) +

theme_minimal(base_size = 8) +

theme(
    strip.background = element_rect(fill = "gray90", 
                                    color = 'black', 
                                    size = 0.3),
    strip.text.y.right = element_text(angle = 0),
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
) +

coord_cartesian(clip = "off")

Resources

Utilized a few resources to help me understand how to do this (Wickham and Henry 2024), (Yu 2018), (Wickham, François, et al. 2024), (Wickham, Chang, et al. 2024), (GeeksforGeeks 2021).

4 - COVID survey - re-reconstruct

covid_survey_summary_stats_all_quartile <- covid_survey_longer |>
    group_by(response) |>
    summarize(
        mean = mean(as.numeric(response_value), 
                    na.rm = TRUE),
        low  = quantile(as.numeric(response_value), 
                        probs = 0.25, 
                        na.rm = TRUE),
        high = quantile(as.numeric(response_value), 
                        probs = 0.75, 
                        na.rm = TRUE),
        explanatory = "All",
        explanatory_value = ""

  )

covid_survey_summary_stats_by_group_quartile <- covid_survey_longer |>
    group_by(explanatory, explanatory_value, response) |>
    summarize(
        mean = mean(as.numeric(response_value), 
                    na.rm = TRUE),
        low  = quantile(as.numeric(response_value), 
                        probs = 0.25, 
                        na.rm = TRUE),
        high = quantile(as.numeric(response_value), 
                        probs = 0.75, 
                        na.rm = TRUE)
  )

covid_survey_summary_stats_quartile <- bind_rows(
    covid_survey_summary_stats_all_quartile,
    covid_survey_summary_stats_by_group_quartile
) |>
    mutate(
        response = factor(response, 
                          levels = names(response_labels),
                          labels = unname(response_labels)),
        explanatory = factor(explanatory, 
                             levels = names(explanatory_labels),
                             labels = unname(explanatory_labels))
    ) 

covid_survey_summary_stats_quartile
# A tibble: 132 × 6
   response        mean   low  high explanatory explanatory_value
   <fct>          <dbl> <dbl> <dbl> <fct>       <chr>            
 1 "I am concern…  3.28     2     4 All         ""               
 2 "I am confide…  1.43     1     2 All         ""               
 3 "Getting the …  1.36     1     1 All         ""               
 4 "Based on my\…  2.03     1     3 All         ""               
 5 "I trust the\…  1.40     1     2 All         ""               
 6 "I will recom…  1.21     1     1 All         ""               
 7 "I am concern…  3.32     2     4 Age         "21-25"          
 8 "I am confide…  1.31     1     2 Age         "21-25"          
 9 "Getting the …  1.20     1     1 Age         "21-25"          
10 "Based on my\…  1.95     1     2 Age         "21-25"          
# ℹ 122 more rows
ggplot(
    covid_survey_summary_stats_quartile, 
    aes(x = mean, 
        y = explanatory_value, 
        xmin = low, 
        xmax = high)
) +

geom_pointrange(size = 0.1) +
geom_errorbar(
    aes(xmin = low, 
        xmax = high), 
        width = 0.3
) +

facet_grid(
    explanatory ~ response, 
    labeller = labeller(
        explanatory = explanatory_labels,
        response = response_labels
    ),
    space = "free_y",
    scales = "free_y"
) +

scale_x_continuous(breaks = 1:5, 
                   limits = c(1,5)) +

labs(x = "Mean Likert score\n(Error bars range from 25th to 75th percentile)",
     y = NULL) +

theme_minimal(base_size = 8) +

theme(
    strip.background = element_rect(fill = "gray90", 
                                    color = 'black', 
                                    size = 0.3),
    strip.text.y.right = element_text(angle = 0),
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
) +

coord_cartesian(clip = "off")

Resources

Utilized a few resources to help me understand how to do this (Wickham and Henry 2024), (Yu 2018), (Wickham, François, et al. 2024), (Wickham, Chang, et al. 2024), (GeeksforGeeks 2021).

Conclusions

This new figure with quarter percentiles calculated appears vastly different. By looking at only the middle 50% of the of the responses, the outer extremes that can lengthen the error bars is minimized. Now the error bars all around are much smaller comparatively.

Observation 1

The error bars a smaller for the responses to the question “Had flu vaccine this year”. The variability in each answer appears less pronounced. The general trend I observed still stands, non-flu-vaccinated individuals seem to be more skeptical but there is less emphasis on response spread.

Oberservation 2

The error bars for the column “I trust information that I have received about the vaccines” got even smaller, reducing the variability even more, which reinforced the initial observation.

Oberservation 3

With smaller error bars again the 2 columns “Based on my understanding, I believe the vaccine is safe” and “I am concerned about the safety and side effects of the vaccine” both have reduced variability, which might revise the original observation that the response spectrum is so extreme.

5 - COVID survey - another view

covid_survey_longer
# A tibble: 43,428 × 5
   response_id explanatory    explanatory_value response         
   <chr>       <chr>          <chr>             <chr>            
 1 1           exp_profession Nursing           resp_safety      
 2 1           exp_profession Nursing           resp_confidence_…
 3 1           exp_profession Nursing           resp_concern_saf…
 4 1           exp_profession Nursing           resp_feel_safe_a…
 5 1           exp_profession Nursing           resp_will_recomm…
 6 1           exp_profession Nursing           resp_trust_info  
 7 1           exp_flu_vax    Yes               resp_safety      
 8 1           exp_flu_vax    Yes               resp_confidence_…
 9 1           exp_flu_vax    Yes               resp_concern_saf…
10 1           exp_flu_vax    Yes               resp_feel_safe_a…
# ℹ 43,418 more rows
# ℹ 1 more variable: response_value <chr>
covid_bars <- covid_survey_longer |>
    filter(response %in% names(response_labels)) |>
    mutate(
        response_value = recode(response_value,
                                "1" = "Strongly Disagree",
                                "2" = "Disagree",
                                "3" = "Neutral",
                                "4" = "Agree",
                                "5" = "Strongly Agree"),
        response = factor(response, 
                          levels = names(response_labels),
                          labels = unname(response_labels))
    ) |>

    group_by(response, 
             response_value) |>

    summarise(n = n(), 
              .groups = "drop") |>

    group_by(response) |>
    mutate(
        percent = n / sum(n) * 100,
        direction = case_when(response_value %in% c("Strongly Disagree", 
                                                    "Disagree") ~ "Negative",
                              response_value == "Neutral" ~ "Neutral",
                              response_value %in% c("Agree", 
                                                    "Strongly Agree") ~ "Positive"),
        percent_adj = case_when(direction == "Negative" ~ -percent,
                                TRUE ~ percent)
        )

colorblind_colors = c("#44AA99", "#CC6677", "#DDCC77", "#117733", "#882255")

Colors

I picked colors (Paul Tol palette) from this website to enhance visualization for colorblindness (Nichols 2020)

ggplot(
    covid_bars, 
    aes(x = percent_adj,
        y = fct_rev(response),
        fill = response_value)
) +
  
geom_col() +
scale_fill_manual(values = colorblind_colors) +
labs(
    title = "Diverging Bar Chart of COVID Survey Responses",
    x = "Percentage",
    y = "Survey Question",
    fill = "Response"
) +

theme_minimal()

A diverging bar chart showing six COVID survey questions on the y-axis. Bars extending left represent 'Strongly Disagree' and 'Disagree' responses (with red shades), while bars extending right show 'Neutral', 'Agree', and 'Strongly Agree' responses (with green shades). The length of each bar indicates the percentage of respondents choosing that option.

Alt Text

A diverging bar chart showing six COVID survey questions on the y-axis. Bars extending left represent ‘Strongly Disagree’ and ‘Disagree’ responses (with red shades), while bars extending right show ‘Neutral’, ‘Agree’, and ‘Strongly Agree’ responses (with green shades). The length of each bar indicates the percentage of respondents choosing that option.

ggplot(
    covid_bars, 
    aes(x = fct_rev(response),
        y = percent, 
        fill = response_value)
) +

geom_col(position = "fill") +
scale_fill_manual(values = colorblind_colors) +
labs(
    title = "100% Stacked Bar Chart of COVID Survey Responses",
    x = "Survey Question", 
    y = "Percentage", 
    fill = "Response"
) +

coord_flip() +
theme_minimal()

A 100% stacked bar chart with six COVID survey questions on the y-axis. Each horizontal bar sums to 100%, divided into color-coded segments representing 'Strongly Disagree' to 'Strongly Agree'. Red shades indicate disagreement, yellow is neutral, and green shades show agreement. The chart highlights response distribution within each question.

Alt Text

A 100% stacked bar chart with six COVID survey questions on the y-axis. Each horizontal bar sums to 100%, divided into color-coded segments representing ‘Strongly Disagree’ to ‘Strongly Agree’. Red shades indicate disagreement, yellow is neutral, and green shades show agreement. The chart highlights response distribution within each question.

Part C Comparison

Diverging Bar Charts are best for visualizing overall sentiment (positive vs. negative) and the intensity of opinion between questions while the 100% Stacked Bar Charts are good for comparing response patterns within each question, but it’s harder to see if total agreement is higher for one question vs another. I am more familiar with the 100% stacked bar charts, as I prefer to see the full spread that each response covers over the whole data, as opposes to the ‘intensity’ of opinion that is more prevalent in the diverging bar chart.

References

GeeksforGeeks. 2021. “How to Calculate Quantiles by Group in r?” https://www.geeksforgeeks.org/how-to-calculate-quantiles-by-group-in-r/.
Nichols, David. 2020. “Color Palette Accessibility Checker.” https://davidmathlogic.com/colorblind/#%23332288-%23117733-%2344AA99-%2388CCEE-%23DDCC77-%23CC6677-%23AA4499-%23882255.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2024. “Geom_linerange: Ggplot2 Reference Manual.” https://ggplot2.tidyverse.org/reference/geom_linerange.html.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2024. “Bind_rows: Dplyr Reference Manual.” https://dplyr.tidyverse.org/reference/bind_rows.html.
Wickham, Hadley, and Lionel Henry. 2024. “Pivot_longer: Tidyr Reference Manual.” https://tidyr.tidyverse.org/reference/pivot_longer.html.
Yu, Guangchuang. 2018. “Setting Ggplot2 Background with ggbackground().” Blog post. https://guangchuangyu.github.io/2018/04/setting-ggplot2-background-with-ggbackground/.